The course learning objectives are:
Link to my GitHub repository for this course.
I prepared myself for this weeks topic by completing DataCamp exercise on “Regression and model validation”.
For the data wrangling part of this weeks exercises I mostly used the methods learned from the DataCamp exercise. I learned the basics of data wrangling in R and know how to find out more methods/tools and how to use them to wrangle the data.
My source code used in the wrangling is here.
Basically, here are the main points what is done in the code:
str and dim.learning2014 is created as new empty data.frame to store the required variables.gender, age, attitude and points are simply copied from the orignal dataset to learning2014.learning2014. dplyr library is used here.points < 1 are filtered out from learning2014.The dataset (learning2014.txt) that the code created can be found from here.
With dim function we learn the dimension of the input dataset. That is the function returns the number of columns and rows.
dim(learning2014)
## [1] 166 7
With str function we learn about the structure of the input dataset. Basically we get to know the number of observations (rows) and variables (columns) in the dataset. The output summary from the function lists the variables, the type of each variable vector and few example values from the vector.
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
I use ggpairs to show detailed graphical overview of the input dataset, just like it was done in the DataCamp course for this weeks exercises. ggpairscreates matrix of plots from the given dataset. With aesthetics mapping we are instructing the colors to be based gender variable and the transparency is controlled with value give to alpha parameter. Parameter lower controls the type of plots for lower section of the plot matrix. All plots are between pair of variables in the learning2014 dataset and each are categorized based on gender variable. Above the diagonal with density plots there are the correlation coeffiecients of the pairs as a whole and then categorized based on gender variable.
From the plots we can see that nearly 2/3 of the observations are from females, thus 1/3 are from males. When we look at the single variables categorized by gender, we see that variable attitude deviates the most between sexes. Also females tend to be more younger than males based on variable age. There is also deviations between female and male in the combined question variables, where surface question group deviates the most and strategy to some extent.
When we look at the correlation coefficients we see that points and attitude correlate in agreement the most (0.437) in the dataset variables, with equally high for both females and males. Then deep and surf combined question variables correlate in disagreement the most (-0.324), with amongst all males the variables disagree (-0.622) clearly more than amongst females (-0.087). Then there are several pairs that have similar correlation in agreement (age,strategic), (deep,attitude) and (points,strategic), and several pairs that have similar correlation in disagreement (strategic,surface), (points,surface), (attitude,surface) and (age,surface).
ggpairs(learning2014,
mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
With summary command we can see statistics of each variable in the dataset learning2014. For categorical variable gender we only get number of values for females and males. Then for each of the numerical varibles we get minimum, maximum, mean, median, 1st and 3rd quantiles.
summary(learning2014)
## gender age attitude points deep
## F:110 Min. :17.00 Min. :14.00 Min. : 7.00 Min. :1.583
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00 1st Qu.:3.333
## Median :22.00 Median :32.00 Median :23.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :22.72 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :33.00 Max. :4.917
## surf stra
## Min. :1.583 Min. :1.250
## 1st Qu.:2.417 1st Qu.:2.625
## Median :2.833 Median :3.188
## Mean :2.787 Mean :3.121
## 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.333 Max. :5.000
From the summary below we can see the explanatory variables (attitude, age, stra) I used to build the model with lm function. I reran the code several times with different combination of variables to find the most fit variables for the model by comparing the summaries of each of the models. Residuals are the difference between actual values of points and the predicted values of points. For most regressions the residuals should be normally distributed, where for goodness of the model can be estimated by the value of mean being close to 0. And here we are close to zero (0.3303). There is a shorthand for estimating the signifigance of the variables at the end of line for each variable in the Coefficients section of the summary. The more the stars at the end of the line the more significant the variable is, dot is still very good. And here we have *** for attitude and . for both stra and age. The p-value tells the probability of variable being NOT relevant, so small values here are good. And as we can see we have very low value for attitude and and still significantly low for both stra and age.
R-squared value tells how close the data are to the fitted regression line. It indicates the percentage that the model explains of the variability of the response data around its mean. 0% when none, 100% when all of the variability is explained. Here we have 21%. In general the higher the R-squared the better the model fits the data. Low R-squared value are not always bad. E.g. on occasions when trying to predict human behavior R-squared values are typically lower than 50%. Here we have a model fitted to variables descibing mostly (results of) human behavior. Also, if R-squared is low but the variables are statistically significant important conclusion can still be drawn. And here we have low R-squared with statistically significant variables.
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 0.34808 0.05622 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## age -0.08822 0.05302 -1.664 0.0981 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
Model assumptions describe the data generating process. How well the assumptions fit reality the better the model describes the phenomenom of interest. With linear regression linearity is assumed: target is modelled as a linear combination of model parameters. Usually is also assumed that the errors are normally distributed, errors are not correlated and have constant variance, and that size of given error does not depend on the explanatory variables.
With QQ-plot of the residuals we can explore the assumption that of errors of the model are normally distributed. Here we in the 2nd plot we have a QQ-plot of the model and we can see that the majority of the explanatory variables fit well to the line and to the normality assumption.
The constant variance assumption can be explored with simple scatter plot of residuals versus model predictions. Any pattern on the scatter plot implies a problem with the assumption. Here the scatter plot is the 1st plot, and I must say that it is a bit difficult to say that is there a pattern (problem) in the plot especially when looking at with less fitted values and more fitted values. Then again, in between of less and more fitted values it is well scattered.
Leverage measures how much impact a single observation has on the model and residuals vs. leverage plot helps identifying observations with high impact. Here the last plot is the residuals vs. leverage plot. Again it is little hard for me to tell, but my interpretation is that there is no single outstanding observation that would drag the model.
I prepared myself for this weeks topic by completing DataCamp exercise on “Logistic Regression”.
As I did last week, for the data wrangling part of this weeks exercises I mostly used the methods learned from the DataCamp exercise. I learned about joining datasets and doing data mutations.
My source code used in the wrangling is here.
Basically, here are the main points what is done in the code:
school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internetalc_use variable was generated to indicate alcohol use (avarage of daily and weekly consumption)high_use variables was generated to indicate high use of alcohol (true is alc_use > 2)The dataset (alc.csv) that the code created can be found from here.
Step one was to create chapter3.Rmd to inject this content to the diary as a child of subscipts in index.Rmd.
The combined dataset contains the data from two Portuguese schools and were originally gathered from school reports and questionnaires. The data was collected initally in two datasets, one regarding student performance in mathemathics and one regarding student performance in Portuguese language. In addition to section Data Wrangling, in which the join criteria and additionally created variables are discussed, you can find more information about the variables in here.
Dimensions and the column (variable) names of the combined dataset:
dim(alc)
## [1] 382 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Column (variable) types of the combined dataset:
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob <fct> teacher, other, other, services, other, other, othe...
## $ reason <fct> course, course, other, home, home, reputation, home...
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
My hypothesis to study is how time consumption is related with alcohol consumption. The idea being that if you spend your time with activities you would not have time to consume alcohol and vice versa. My choice of variables are in sense time consuming variables: freetime, goout, studytime and traveltime.
traveltime - Home to school travel time. My intuition here is this would be a factor to decrease alcohol consumption.studytime - Weekly study time. My intuition here is this would be a factor to decrease alcohol consumption.freetime - Free time after school. My intuition here is this would be a factor to increase alcohol consumption.goout - Going out with friends. My intuition here is this would be a factor to increase alcohol consumption.Here I use ggpairs, from last week, to visualize the chosen variables together with the variable high_use. From the diagram we get quickly a good overview of the relations of the variables between themselves and with the variable high_use. By looking at the correlation coefficients we can observe that there is not much correlation between the chosen variables. We could say that freetime and goout mildly correlate positively, meaning that you would go out with friends during your freetime which makes common sense. Also that traveltime and studytime mildly correlate negatively, the same with freetime and studytime, meaning that travel time between school and home, also freetime, is not used for studying. My interpretation is that the low correlation between the chosen variables could tell that variables are not interfering with each other and be independently a factor to increase or decrease alcohol consumption. And that can be seen in favor of my hypothesis.
From the density plots we can see that there are some deviations for each of the chosen variable when comparing to high/low alcohol usage. Most distinctive difference is between goout and high_use where we could say that those who go out more have the tendency to high alcohol consumption. When comparing freetime and high_use density one could argue that when you have less free time there is less observations of high alcohol consumptions which are growing when you have more free time. Similar argument can be made when comparing studytime and high_use, where there are less high alcohol consumption when more time spent on studies. And also when comparing traveltime and high_use one could say that if you live close to school you have more time to consume alcohol. In my interpretation these are still supporting my hypothesis, but it is not so clear because the densities between low/high alcohol consumption per chosen variable are very similar: there are 1/3 high consuming versus 2/3 low consuming observations, and the values of the attiributes are dicrete and few.
Looking at the box and bar plots of each chosen variable divided by alcohol we can see that traveltime and freetime are quite identically distributed both in terms of high and low alcohol consumption. When looking at the high alcohol consumption with freetime it seems that the hight alcohol consumption increases more when freetime increases, whereas low alcohol consumption increases less when freetime increases.
With studytime and goout the distributions differ more in terms of high and low alcohol consumption. When looking at the high alcohol consumption with studytime it seems that the high alcohol consumption increases more when studytime decreases compared to low alcohol consumption. Also high alcohol consumption increases more when go out time with friends increases compared to low alcohol consumption.
Here we use logistic regression to statistically explore the relationship between chosen variables and the binary high/low alcohol consumption.
##
## Call:
## glm(formula = high_use ~ studytime + goout + traveltime + freetime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5805 -0.7670 -0.5270 0.8878 2.4724
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.9531 0.7013 -4.211 2.54e-05 ***
## studytime -0.5612 0.1655 -3.391 0.000697 ***
## goout 0.7206 0.1212 5.947 2.73e-09 ***
## traveltime 0.3619 0.1752 2.066 0.038868 *
## freetime 0.0921 0.1354 0.680 0.496389
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 396.18 on 377 degrees of freedom
## AIC: 406.18
##
## Number of Fisher Scoring iterations: 4
From the summary we can see the distribution of deviance residuals for the individual cases used in the model. The table of coeffiecients shows the coefficients and its standard error for each variable. The coefficients tell the change in the log odds of the outcome for a one unit increase in the predictor variable (e.g. one unit change in goout gives the log odds of high_use increase by 0.7206). From the p-values we can see that the probability of each variable NOT being relevant (so low score here is good). From the significance indicators we can tell that studytime, goout and traveltime are statistically significant, but freetime is not.
From the below odd ratios of the coefficients we can say that each of the chosen variable is positively associated with high consumption of alcohol (when we have defined high_use as binary, TRUE / FALSE). From the odds ratios and confidence intervals we can say that since studytime is below 1 values (the odds ratio and the confidence interval) it is in favor of low alcohol consumption. Both goout and traveltime is above 1 so they both are in favor of high alcohol consumption. Where as freetime has interval that includes value 1, meaning that it is random to which side the variable is in favor of.
## OR 2.5 % 97.5 %
## (Intercept) 0.0521788 0.01274364 0.2006731
## studytime 0.5705061 0.40821453 0.7826192
## goout 2.0556712 1.63108393 2.6258521
## traveltime 1.4359821 1.01912982 2.0303214
## freetime 1.0964726 0.84079237 1.4317362
Based on summary and odds ratios I conclude that studytime, goout and traveltime support my hypothesis but surprisingly freetime does not. And since freetime is not statistically significant I will drop that variable from further parts of this exercise analysis.
First we need to re-fit the model by taking out the statistically not significant freetime variable.
##
## Call:
## glm(formula = high_use ~ studytime + goout + traveltime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5318 -0.7679 -0.5214 0.8604 2.4526
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6967 0.5878 -4.588 4.47e-06 ***
## studytime -0.5716 0.1650 -3.464 0.000532 ***
## goout 0.7431 0.1172 6.343 2.25e-10 ***
## traveltime 0.3559 0.1746 2.039 0.041486 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 396.64 on 378 degrees of freedom
## AIC: 404.64
##
## Number of Fisher Scoring iterations: 4
Then we use the model to predict probabilities for the date that was used to train the model. Based on the probabilities we make predictions for the training data observations so that probability > 0.5 is high alcohol consumption, otherwise low. In the first table below we can see the counts of the true values of high_use and of the predictions. And in the second table are the proportions of the same. From these we can compute that the model gets it (0.63874346 + 0.11780105 = 0.75654451) ~76% times correct, meaning the training error being ~24%. The model gets 91% of the true low alcohol consumption observations correct, but only ~39% of the true high alcohol consumption observations correct.
## prediction
## high_use FALSE TRUE Sum
## FALSE 244 24 268
## TRUE 69 45 114
## Sum 313 69 382
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63874346 0.06282723 0.70157068
## TRUE 0.18062827 0.11780105 0.29842932
## Sum 0.81937173 0.18062827 1.00000000
As as performance comparison to a simple guessing strategy, with weights 2/3 to low alcohol consumption and 1/3 to high alcohol consumption, we can see that the model beats this guessing strategy. The results from the simple guessing strategy is below. Simple guessing strategy get ~59% correct from the true totals, ~70% of the true low alcohol consumption and ~32% of the true high alcohol consumption. (Note: Since guessing strategy is based on randomizer there might be slight changes on percentages written above and how many time this rmarkdown code is run after what I had written)
## prediction
## high_use 0 1 Sum
## FALSE 187 81 268
## TRUE 77 37 114
## Sum 264 118 382
## prediction
## high_use 0 1 Sum
## FALSE 0.48952880 0.21204188 0.70157068
## TRUE 0.20157068 0.09685864 0.29842932
## Sum 0.69109948 0.30890052 1.00000000
Here is also graphical visualization of the actual and predicted values.
For the 10-fold cross-validation we need to define a loss function. I use the same definition of loss function as was done in the Data Camp exercises. Here is the error-rate of the above trained model computed with the loss function.
## [1] 0.2434555
And here is the error-rate of the trained model wih 10-fold cross-validation. As you can see the model slightly outperforms the model used in Data Camp exercise.
## [1] 0.2539267
In this part of the exercise set I used only variables that are of numeric type (type:int). There are 14 such variables. I use for loop to add a predictor in one-by-one fashion and do the cross-validation step to store the error in betweens. Unfortunately my plotting skills are poor and I was not able to flip the value-labels of x-axis (was able to flip the “curve” but decided not to since it would make it more confusing), hence the errors are represented in this plot in increasing number of predictors order.
## [1] 0.2958115 0.2958115 0.2958115 0.2879581 0.2879581 0.2879581 0.2879581
## [8] 0.2958115 0.2565445 0.2670157 0.2539267 0.2513089 0.2486911 0.2382199
Also for this week I prepared myself by completing DataCamp exercise on “Clustering and classification”.
Step one again was to create chapter4.Rmd to inject its content to the actual course diary as a child of subscipts in index.Rmd.
In this exercise we are using the Boston dataset from MASS packege for R. The dataset contains “Housing Values in Suburbs of Boston” according to the dataset description in here where you can also find details about the dataset columns.
Boston dataset is a class type of: data.frame. The dimensions of the dataset are 506 observations (rows) with 14 variables (columns).
Structure of the dataset is listed below.
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
First, with a summary, and together with the additional information available in here, we can take a look at the different scales of each of the variables. There are different rates, ratios, proprotions, avarages, means, medians and dummy (boolen indicator) as values used for different variables.
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Second, we can look at the density plots of each variable. I was tempted to draw relationship conclusion already based on these diagrams, but as we can see from correlation matrix (coming up next) things are not so obvious.
Third, with a type of correlation matrix we can observe the relationships between the variables. I used a mixed version of the corrplot which was introduced in the DataCamp exercises. From this one you can visually locate the varibles with positive (→ 1) and negative (→ -1) relations from the top triangle and check their coefficients from the lower triangle. We can observe that variable pair (rad, tax) have the highest (> 0.9) positive relationship. Also pairs (indus,nox), (indus,tax), (nox,age) and (rm,mediv) have high posivite relationship (> 0.7). On the high negative relationships (< -0.7) we can mention pairs (indus,dis), (nox,dis), (age,dis) and (lstat,medv). Also noteworthy is that variable chas does not seem to correlate positively or negatively with any other variable in the dataset.
After scaling the variable values are scaled to normal distribution with 0 mean (or very close) and variance 1. Meaning the amout of values are equally numbered on both sides of 0 mean. This can be observed from the density plots after the summary (compare to similar density plot above).
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Then the based on the quantiles of the scaled crime rate variable crim a categorical variable is created. Original variable crim is removed and the new created variable crime is added to the date set. Here is the stucture of the modified dataset.
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
Dataset is split into two, so that 80% of the data is in train set and 20% in the test set.
## [1] 404 14
## [1] 102 14
Here the LDA model is fitted with the train set. The model summary is printed. We can see that LD1 covers more than 0.95 of the group variance. Next the LDA is plotted as biplot. I used just 1 discriminant. There is something going on with the arrows, they don’t work on the same scale as in the DataCamp. In order to even get some of the arrows visible I needed scale way more. (the number of discriminants did not have any effect on this property).
!!Tue 27th!! Found the issue, not the arrows but wrong data to plot :(
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2376238 0.2623762 0.2524752
##
## Group means:
## zn indus chas nox rm
## low 0.93520882 -0.8907091 -0.11484506 -0.8461428 0.4427384
## med_low -0.08995672 -0.2868547 -0.02626030 -0.5436521 -0.1974597
## med_high -0.38045185 0.1548430 0.17338039 0.3563715 0.1188784
## high -0.48724019 1.0171096 -0.04073494 1.0719641 -0.4264826
## age dis rad tax ptratio
## low -0.8504039 0.8282951 -0.6959029 -0.7237406 -0.44783279
## med_low -0.3196882 0.3639736 -0.5488034 -0.4678598 -0.02228321
## med_high 0.3883868 -0.3655824 -0.3870520 -0.2863228 -0.29451506
## high 0.7836698 -0.8384347 1.6382099 1.5141140 0.78087177
## black lstat medv
## low 0.37713478 -0.75166181 0.49724279
## med_low 0.30962322 -0.09439842 -0.04649268
## med_high 0.08005809 -0.02617694 0.19348023
## high -0.69571287 0.87641107 -0.65573310
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12516906 0.86966092 -0.94732196
## indus -0.02273021 -0.17179618 0.18549215
## chas -0.08212910 -0.07022140 0.07547652
## nox 0.42388257 -0.65600866 -1.21121635
## rm -0.11735441 -0.02662857 -0.24806326
## age 0.30134507 -0.32963416 -0.16565107
## dis -0.05014585 -0.37462365 0.32242714
## rad 2.98404790 1.05684898 -0.14161191
## tax 0.03103018 -0.26536574 0.64453024
## ptratio 0.11510171 0.05685743 -0.19787389
## black -0.13853306 0.04292244 0.11162022
## lstat 0.19938790 -0.23811448 0.41562347
## medv 0.19314756 -0.46312460 -0.11590396
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9479 0.0384 0.0137
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
# fixed this part
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 3)
The LDA model fitted in previous step is used to predict the crime class for test data after real test values are removed from the data. Then the predicted class results are compared against the known true values in the table below. We can see that on the test data the predicted values for class high are almost perfect. Class med_high seems to have the worst prediction precision since almost half of the values are predicted to other classes. Class med_low gets almost similarly bad result where 2/3 of the values are predicted to wrong classes. Class low has the second best prediction precision but for that almost 1/3 get predicted to wrong class.
## predicted
## correct low med_low med_high high
## low 15 12 0 0
## med_low 6 16 8 0
## med_high 1 5 14 0
## high 0 0 0 25
First Boston dataset is reloaded and scaled again.
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num -0.419 -0.417 -0.417 -0.416 -0.412 ...
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
Then the euclidian distances between the scaled observations is computed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Simple initial run of K-means algorithm with randomly chosen number of cluster centers (5). The pairs plot is hard to read when all variables are compared with each other. Splitting the data for pairs plot makes it more readable but you will then miss the comparson between all variables.
To determine the optimal number of clusters total WCSS (within cluster sum of squares) can be used for help. Here I use 14 as the maximum cluster, just based on the number of variables in the dataset, and quick plot the tWCSS results. The optimal number of clusters is when the tWCSS drops dramatically, and here we get the same result (2 clusters) as in the Data Camp exercise.
Running the K-means algorithm again but now with 2 cluster as per optimal suggestion. Plotting the results with pairs to see all variables compared and split to two clusters.
I found the instructions for this bonus exercise a little ambiguous, so I took some freedom since it is not clearly stated how some of the things should be done.
So, first I took the Boston dataset and scaled it. Did nothing to crim, no categorization.
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Then I ran k-means on the scaled data with 3 clusters specified. Showing the pairs plot. Took the cluster vector from results and added that as new column to origal scaled data.
## 'data.frame': 506 obs. of 15 variables:
## $ crim : num -0.419 -0.417 -0.417 -0.416 -0.412 ...
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ cluster: int 1 1 1 1 1 1 1 1 1 1 ...
I did the same split into train and test dataset, as we did here earlier, so that I can see how well is the model predicting. Most influencial linear separators being medv and chas.
## Call:
## lda(cluster ~ ., data = train)
##
## Prior probabilities of groups:
## 1 2 3
## 0.5148515 0.3391089 0.1460396
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3941546 0.31227411 -0.6370635 -0.27232907 -0.65912810 -0.01684603
## 2 0.7549052 -0.48724019 1.1552802 -0.09990132 1.11285090 -0.46105942
## 3 -0.3149865 0.02365179 -0.3280701 1.39593374 -0.09317372 1.15993858
## age dis rad tax ptratio black
## 1 -0.6086596 0.6311264 -0.6003372 -0.6010240 -0.07361837 0.3163746
## 2 0.8059442 -0.8487400 1.0652503 1.1624509 0.51852048 -0.5592621
## 3 0.2613110 -0.3401229 -0.3667601 -0.5552257 -0.97842899 0.2623650
## lstat medv
## 1 -0.3954831 0.09021249
## 2 0.9137863 -0.77094599
## 3 -0.6633176 1.38467393
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.01117325 0.09031124
## zn 0.42825390 -0.07804598
## indus 1.27189292 0.21207572
## chas -0.29464473 0.82381171
## nox 0.81644143 -0.31106867
## rm 0.14381601 0.56471935
## age -0.11369692 0.51461099
## dis 0.12926754 -0.35989536
## rad 0.70042482 0.35646727
## tax 0.31049220 -0.22916058
## ptratio 0.06106314 -0.40680906
## black -0.04908228 -0.05389697
## lstat 0.32333200 0.40960024
## medv -0.16431904 0.87258292
##
## Proportion of trace:
## LD1 LD2
## 0.7556 0.2444
## predicted
## correct 1 2 3
## 1 57 1 0
## 2 0 33 0
## 3 2 1 8
And we can see that the precidtion is quite precise. But what these 3 classes should represent with this data (when above we were predicting crime).
In this part it was needed to read two datasets from the internet. Then rename the column names in both datasets, create 2 new variables and the combine the data in to one dataset and store it. I had some problems when reading the stored data, I did not get the same amount of observations that was in the file (I got 164 instead of 195 in the file). Did not see really anything wrong in the file, but I turned quotes to it default value (TRUE) when saving the data, and after that the data with correct amount of observations could be read.
The code for the date can be found from here
And the data can be found from here
This week we took a look at dimensionality reduction methods namely principal component ananlysis (PCA) and multiple correspondence analysis (MCA).
There was the usual Data Camp exercise as on introduction to dimensionality reduction techniques.
Rest of the exercises are presented in this chapter of the diary.
In the data wrangling part we used the combined dataset from last weeks exercise and did some further manipulations to it.
The script for the data mutilation is here.
And the resultind data is here.
As I had stored previously my data file as .csv I noticed that when reading the file with read.csv2 numeric (n.i. integers) are interpreted as Factorials. With that one needs to convert those string back to numeric form. Or other way around it is to use read.table and then such variables are “automagically” interpreted as numerics.
First I just read.table and show the structure of the file.
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
Next I visualize the data with ggpairs and corrplot.mixed. From the ggpairs plot we can observe the density of each variable. In this plot there are the correlation coefficients on upper part, but it is a little difficult to read. Then the scatter plots in the lower part show the pairwise correlation.
The corrplot.mixed shows nicely the pairwise correlation of each variable. The two most similar variable pairs are (Edu.Exp,Life.Exp) and (Mat.Mor,Ado.Birth). And the two most dissimilar variable pairs are (Life.Exp,Mat.Mor) and (Mat.Mor,Edu.Exp).
From the summary we can see that there is some difference in the scale of the value when comparing the variables. We can also observe the variance of each variables values.
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
First the PCA is performed with non-stardardized human data (summary of non-standarized data above). The table below is the summary of the performed PCA, showing that the first principal component (PC1) captures almost all, ~99.99%, of the variance in the non-stardardized data.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
Next the PCA is performed with stardardized human data (summary of standarized data below).
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
The summary of the PCA performed on standardized data is shown in table below. The first principal component (PC1) captures roughly 53% and the second principle component (PC2) roughly 16% of the variance in the data, making it almost 70% of the variance capture by the first two principal components. The results between the PCA on non-standardized and standardized data differs a lot because PCA method assumes that variables with larger variance are more important and hence the method is sensitive to the relative scaling of the original variables.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
From the biplot below we can see which variables are correlating with each other and with which PC. Group 1, with Abo.Birth and Mat.Mor, are pointing to the same direction and have a small angle between them, meaning they have strong positive correlattion. Group 2, with Parli.F and Labo.FM, are pointing to the same direction, meaning they have a strong positive correlation. Group 3, with GNI, Edu.Exp, Life.Exp and Edu2.FM, are pointing to the same direction and have a small angles between each other, meaning they have a strong positive correlation. Group 2 is almost orthogonal to both Group 1 and Group3 meaning Group 2 variables do not correlate/effect the variables of two other groups. Group 1 and Group 2 are pointing to opposite directions meaning the Group 1 and have negative correlation between the variables in those groups. Also, Group 1 variables is almost parallel with the second principal component and Groups 1 and 3 with the first principal component, meaning the groups have high positive correlation to those PC’s respectively.
human dataMy initial interpretation/understanding about PCA follows from the course material. It is a dimension reduction method for high dimensional data. It captures the most important dimensions (variables) from high-dimensional dataset as new (principal) components. Each PC is linear combination of the original dimensions that captures the variance in dataset, the first PC capturing the most of the variance (information), the consecutive PC’s capture the most variance left and is uncorrelated to the previous component. The biplot above and the table below explain the orignal values contributing each of the two principal component: PC1 is contributed by the variables in Group 1 and in Group3, PC2 if contributed by the variables in Group 2.
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
(Sorry, couldn’t resist a joke on YMCA.)
By loading the tea dataset from FactoMineR package and by observing it can be seen that the dataset holds 300 observations with 36 variables. 35 variables are factorials and only one variable (age) is numerical. Age is also represented as categorical variable age_Q in the dataset.
## [1] 300 36
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
Removed variable age from the dataset and used rest of the variables for visualization below.
## [1] 300 35
For the MCA I will consider only the variables with binary values. Only for the reason that when running the MCA with all variables the dimensions seems not to be 1:1 with variables → each exrta value is its own dimension resulting in 54 dimension if all variables are included. I have no other criteria on what to choose or not to choose as variables, just thinking is it more simple just with binary valued variables.
##
## Call:
## FactoMineR::MCA(X = tea_sup, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.093 0.070 0.061 0.059 0.054 0.050
## % of var. 9.329 6.992 6.105 5.871 5.356 5.050
## Cumulative % of var. 9.329 16.322 22.427 28.298 33.654 38.704
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.045 0.042 0.041 0.038 0.038 0.037
## % of var. 4.527 4.159 4.124 3.840 3.794 3.704
## Cumulative % of var. 43.231 47.390 51.514 55.354 59.148 62.853
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17 Dim.18
## Variance 0.036 0.033 0.032 0.030 0.029 0.027
## % of var. 3.587 3.253 3.156 2.964 2.892 2.690
## Cumulative % of var. 66.439 69.692 72.848 75.811 78.703 81.393
## Dim.19 Dim.20 Dim.21 Dim.22 Dim.23 Dim.24
## Variance 0.026 0.025 0.023 0.022 0.020 0.020
## % of var. 2.586 2.479 2.263 2.203 1.982 1.976
## Cumulative % of var. 83.979 86.459 88.722 90.925 92.906 94.882
## Dim.25 Dim.26 Dim.27
## Variance 0.019 0.017 0.015
## % of var. 1.851 1.739 1.527
## Cumulative % of var. 96.733 98.473 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.683 1.668 0.536 | 0.139 0.092 0.022 | -0.412
## 2 | -0.459 0.753 0.246 | -0.231 0.254 0.062 | -0.421
## 3 | 0.081 0.024 0.005 | 0.360 0.617 0.103 | -0.142
## 4 | -0.702 1.762 0.368 | -0.184 0.162 0.025 | 0.220
## 5 | -0.263 0.247 0.076 | 0.148 0.105 0.024 | -0.092
## 6 | -0.772 2.128 0.454 | 0.079 0.030 0.005 | -0.227
## 7 | -0.351 0.441 0.163 | 0.289 0.398 0.110 | -0.511
## 8 | 0.055 0.011 0.005 | 0.033 0.005 0.002 | -0.143
## 9 | -0.430 0.662 0.232 | 0.178 0.151 0.040 | -0.321
## 10 | -0.119 0.050 0.016 | 0.132 0.083 0.020 | -0.307
## ctr cos2
## 1 0.927 0.195 |
## 2 0.970 0.207 |
## 3 0.110 0.016 |
## 4 0.265 0.036 |
## 5 0.046 0.009 |
## 6 0.281 0.039 |
## 7 1.424 0.344 |
## 8 0.112 0.033 |
## 9 0.563 0.129 |
## 10 0.513 0.106 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## always | 0.361 1.778 0.068 4.516 | -0.095 0.164 0.005
## Not.always | -0.189 0.930 0.068 -4.516 | 0.050 0.086 0.005
## breakfast | 0.062 0.072 0.004 1.025 | 0.075 0.143 0.005
## Not.breakfast | -0.057 0.067 0.004 -1.025 | -0.069 0.132 0.005
## dinner | -1.257 4.389 0.119 -5.962 | -0.195 0.142 0.003
## Not.dinner | 0.095 0.330 0.119 5.962 | 0.015 0.011 0.003
## evening | 0.293 1.167 0.045 3.659 | 0.674 8.250 0.237
## Not.evening | -0.153 0.610 0.045 -3.659 | -0.352 4.313 0.237
## lunch | 0.451 1.183 0.035 3.232 | 0.807 5.060 0.112
## Not.lunch | -0.077 0.203 0.035 -3.232 | -0.139 0.870 0.112
## v.test Dim.3 ctr cos2 v.test
## always -1.188 | 0.281 1.645 0.041 3.514 |
## Not.always 1.188 | -0.147 0.860 0.041 -3.514 |
## breakfast 1.246 | -0.610 10.830 0.343 -10.132 |
## Not.breakfast -1.246 | 0.563 9.997 0.343 10.132 |
## dinner -0.927 | 0.798 2.706 0.048 3.787 |
## Not.dinner 0.927 | -0.060 0.204 0.048 -3.787 |
## evening 8.421 | 0.321 2.147 0.054 4.015 |
## Not.evening -8.421 | -0.168 1.123 0.054 -4.015 |
## lunch 5.786 | -0.258 0.593 0.011 -1.850 |
## Not.lunch -5.786 | 0.044 0.102 0.011 1.850 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## always | 0.068 0.005 0.041 |
## breakfast | 0.004 0.005 0.343 |
## dinner | 0.119 0.003 0.048 |
## evening | 0.045 0.237 0.054 |
## lunch | 0.035 0.112 0.011 |
## tea.time | 0.280 0.058 0.026 |
## friends | 0.197 0.097 0.048 |
## home | 0.009 0.011 0.047 |
## pub | 0.140 0.010 0.000 |
## resto | 0.128 0.135 0.007 |
Selecting many variables (this has 27) makes the MCA plot less readable. It can be observed that many variables are close, but then again which?
Here I use the original dataset again, this time using the parameters of the MCA call to denote the different types of variables. Variables 1-18 is denoted as active, 19 (age) as supplemetentery continuous variable and variables 20-36 are as supplementery categorical. Then I use plot.MCA to output several plots for the performed MCA.
In the first plot there are the individuals and we can see there are not groupings here.
Second is the active (in MCA) categorical variables. Hard to make judgements here.
Third is the supplementary variables which did not contribute in the MCA.